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Abstract — The information in an individual finite object (like 
a binary string) is commonly measured by its Kolmogorov 
complexity. One can divide that information into two parts: 
the information accounting for the useful regularity present in 
the object and the information accounting for the remaining 
accidental information. There can be several ways (model classes) 
in which the regularity is expressed. Kolmogorov has proposed 
the model class of finite sets, generalized later to computable 
probability mass functions. The resulting theory, known as 
Algorithmic Statistics, analyzes the algorithmic sufficient statistic 
when the statistic is restricted to the given model class. However, 
the most general way to proceed is perhaps to express the 
useful information as a recursive function. The resulting measure 
has been called the "sophistication" of the object. We develop 
the theory of recursive functions statistic, the maximum and 
minimum value, the existence of absolutely nonstochastic objects 
(that have maximal sophistication — all the information in them is 
meaningful and there is no residual randomness), determine its 
relation with the more restricted model classes of finite sets, and 
computable probability distributions, in particular with respect 
to the algorithmic (Kolmogorov) minimal sufficient statistic, 
the relation to the halting problem and further algorithmic 
properties. 

Index Terms — 

constrained best-fit model selection, computability, lossy com- 
pression, minimal sufficient statistic, non-probabilistic statistics, 
Kolmogorov complexity, Kolmogorov Structure function, suffi- 
cient statistic, sophistication 



I. Introduction 

The information contained by an individual finite object 
(like a finite binary string) is objectively measured by its 
Kolmogorov complexity — the length of the shortest binary 
program that computes the object. Such a shortest program 
contains no redundancy: every bit is information; but is it 
meaningful information? If we flip a fair coin to obtain a 
finite binary string, then with overwhelming probability that 
string constitutes its own shortest program. However, also 
with overwhelming probability all the bits in the string are 
meaningless information, random noise. On the other hand, 
let an object a; be a sequence of observations of heavenly 
bodies. Then x can be described by the binary string pd, 
where p is the description of the laws of gravity, and the 
observational parameter setting, while d is the data-to-model 
code accounting for the (presumably Gaussian) measurement 
error in the data. This way we can divide the information in x 
into meaningful information p and data-to-model information 
d. 
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The main task for statistical inference and learning theory 
is to distil the meaningful information present in the data. The 
question arises whether it is possible to separate meaningful 
information from accidental information, and if so, how. 

In statistical theory, every function of the data is called 
a "statistic" of the data. The central notion in probabilistic 
statistics is that of a "sufficient" statistic, introduced by the 
father of statistics R.A. Fisher [4]: "The statistic chosen should 
summarise the whole of the relevant information supplied by 
the sample. This may be called the Criterion of Sufficiency 
... In the case of the normal curve of distribution it is evident 
that the second moment is a sufficient statistic for estimating 
the standard deviation." For traditional problems, dealing with 
frequencies over small sample spaces, this approach is appro- 
priate. But for current novel applications, average relations are 
often irrelevant, since the part of the support of the probability 
density function that will ever be observed has about zero 
measure. This is the case in, for example, complex video and 
sound analysis. There arises the problem that for individual 
cases the selection performance may be bad although the 
performance is good on average. There is also the problem 
of what probability means, whether it is subjective, objective, 
or exists at all. 

To simplify matters, and because all discrete data can be 
binary coded, we consider only data samples that are finite 
binary strings. The basic idea is to found statistical theory 
on finite combinatorial principles independent of probabilistic 
assumptions, as the relation between the individual data and 
its explanation (model). We study extraction of meaningful in- 
formation in an initially limited setting where this information 
be represented by a finite set (a model) of which the object 
(the data sample) is a typical member. Using the theory of 
Kolmogorov complexity, we can rigorously express and quan- 
tify typicality of individual objects. But typicality in itself is 
not necessarily a significant property: every object is typical in 
the singleton set containing only that object. More important is 
the following Kolmogorov complexity analog of probabilistic 
minimal sufficient statistic which implies typicality: The two- 
part description of the smallest finite set, together with the 
index of the object in that set, is as concise as the shortest 
one-part description of the object. The finite set models the 
regularity present in the object (since it is a typical element of 
the set). This approach has been generalized to computable 
probability mass functions. The combined theory has been 
developed in detail in [6] and called "Algorithmic Statistics." 

Here we study the most general form of algorithmic statistic: 
recursive function models. In this setting the issue of mean- 
ingful information versus accidental information is put in its 
starkest form; and in fact, has been around for a long time 
in various imprecise forms unconnected with the sufficient 
statistic approach: The issue has sparked the imagination and 
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entered scientific popularization in [8] as "effective complex- 
ity" (here "effective" is apparently used in the sense of "pro- 
ducing an effect" rather than "constructive" as is customary in 
the theory of computation). It is time that it receives formal 
treatment. Formally, we study the minimal length of a total 
recursive function that leads to an optimal length two-part 
code of the object being described. ("Total" means the function 
value is defined for all arguments in the domain, and "partial" 
means that the function is possibly not total.) This minimal 
length has been called the "sophistication" of the object in 
[14], [15] in a different, but related, setting of compression 
and prediction properties of infinite sequences. That treatment 
is technically sufficiently vague so as to have no issue for 
the present work. We develop the notion based on prefix 
Turing machines, rather than on a variety of monotonic Turing 
machines as in the cited papers. Below we describe related 
work in detail and summarize our results. Subsequently, we 
formulate our problem in the formal setting of computable 
two-part codes. 

A. Related Work 

A.N. Kolmogorov in 1974 [11] proposed an approach to a 
non-probabihstic statistics based on Kolmogorov complexity. 
An essential feature of this approach is to separate the data 
into meaningful information (a model) and meaningless infor- 
mation (noise). Cover [2], [3] attached the name "sufficient 
statistic" to a model of which the data is a "typical" member. 
In Kolmogorov's initial setting the models are finite sets. As 
Kolmogorov himself pointed out, this is no real restriction: 
the finite sets model class is equivalent, up to a logarithmic 
additive term, to the model class of computable probability 
density functions, as studied in [19], [20], [23]. Related aspects 
of "randomness deficiency" were formulated in [12], [13] 
and studied in [19], [24]. Despite its evident epistemolog- 
ical prominence in the theory of hypothesis selection and 
prediction, only selected aspects of the theory were studied 
in these references. Recent work [6] can be considered as 
a comprehensive investigation into the sufficient statistic for 
finite set models and computable probability density function 
models. Here we extend the approach to the most general form: 
the model class of total recursive functions. This idea was 
pioneered by [14], [15] who, unaware of a statistic connec- 
tion, coined the cute word "sophistication." The algorithmic 
(minimal) sufficient statistic was related to an apphed form 
in [23], [7]: the well-known "minimum description length" 
principle [1] in statistics and inductive reasoning. 

In another paper [21] (chronologically following the present 
paper) we comprehensively treated all stochastic properties 
of the data in terms of Kolmogorov's so-called structure 
functions. The sufficient statistic aspect, studied here, covers 
only part of these properties. The results on the structure 
functions, including (non)computability properties, are valid, 
up to logarithmic additive terms, also for the model class of 
total recursive functions, as studied here. 

B. This Work: 

It will be helpful for the reader to be famiUar with initial 
parts of [6]. In [11], Kolmogorov observed that randomness of 



an object in the sense of having high Kolmogorov complexity 
is being random in just a "negative" sense. That being said, 
we define the notion of sophistication (minimal sufficient 
statistic in the total recursive function model class). It is 
demonstrated to be meaningful (existence and nontriviahty). 
We then estabUsh lower and upper bounds on the sophisti- 
cation, and we show that there are objects the sophistication 
achieves the upper bound. In fact, these are objects in which all 
information is meaningful and there is (almost) no accidental 
information. That is, the simplest explanation of such an 
object is the object itself. In the simpler setting of finite 
set statistic the analogous objects were called "absolutely 
non-stochastic" by Kolmogorov. If such objects have high 
Kolmogorov complexity, then they can only be a random 
outcome of a "complex" random process, and Kolmogorov 
questioned whether such random objects, being random in 
just this "negative" sense, can occur in nature. But there are 
also objects that are random in the sense of having high 
Kolmogorov complexity, but simultaneously are are typical 
outcomes of "simple" random processes. These were therefore 
said to be random in a "positive" sense [11]. An example 
are the strings of maximal Kolmogorov complexity; those are 
very unsophisticated (with sophistication about 0), and are 
typical outcomes of tosses with a fair coin — a very simple 
random process. We subsequently establish the equivalence 
between sophistication and the algorithmic minimal sufficient 
statistics of the finite set class and the probability mass func- 
tion class. Finally, we investigate the algorithmic properties 
of sophistication: nonrecursiveness, upper semicomputability, 
and intercomputability relations of Kolmogorov complexity, 
sophistication, halting sequence. 

11. Preliminaries 

A string is a finite binary sequence, an element of {0, 1}*. 
If X is a string then the length l{x) denotes the number of 
bits in X. We identify TV, the natural numbers, and {0,1}* 
according to the correspondence 

(0,e), (1,0), (2,1), (3, 00), (4, 01),... 

Here e denotes the empty word. Thus, l{e) — 0. The emphasis 
is on binary sequences only for convenience; observations 
in any alphabet can be so encoded in a way that is 'theory 
neutral' . Below we will use the natural numbers and the strings 
interchangeably. 

A string y is a proper prefix of a string x if we can write 
X — yz for z 7^ e. A set {x, y, . . .} C {0, 1}* is prefix-free if 
for any pair of distinct elements in the set neither is a proper 
prefix of the other. A prefix-free set is also called a prefix 
code and its elements are called code words. An example of 
a prefix code, that is useful later, encodes the source word 
X = X1X2 ■ ■ - Xn by the code word 

X = rox. 

This prefix-free code is called self-delimiting, because there 
is fixed computer program associated with this code that can 
determine where the code word x ends by reading it from 
left to right without backing up. This way a composite code 
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message can be parsed in its constituent code words in one 
pass, by the computer program. (This desirable property holds 
for every prefix-free encoding of a finite set of source words, 
but not for every prefix-free encoding of an infinite set of 
source words. For a single finite computer program to be 
able to parse a code message the encoding needs to have a 
certain uniformity property like the x code.) Since we use the 
natural numbers and the strings interchangeably, l{x) where x 
is ostensibly an integer, means the length in bits of the self- 
delimiting code of the string with index x. On the other hand, 
l{x) where x is ostensibly a string, means the self-delimiting 
code of the string with index the length l{x) of x. Using 
this code we define the standard self-delimiting code for x 
to be x' — l{x)x. It is easy to check that l{x) = 2n + 1 and 
l{x') = n + 21ogn + 1. Let (•) denote a standard invertible 
effective one-one encoding from Af x M to a subset of JV. 
For example, we can set {x, y) = x'y or {x, y) — xy. We can 
iterate this process to define {x, (y, z)), and so on. 

A. Kolmogorov Complexity 

For definitions, notation, and an introduction to Kolmogorov 
complexity, see [16]. Informally, the Kolmogorov complexity, 
or algorithmic entropy, K{x) of a string x is the length (num- 
ber of bits) of a shortest binary program (string) to compute x 
on a fixed reference universal computer (such as a particular 
universal Turing machine). Intuitively, K{x) represents the 
minimal amount of information required to generate x by 
any effective process. The conditional Kolmogorov complexity 
K{x\y) of X relative to y is defined similarly as the length 
of a shortest program to compute x, if y is furnished as 
an auxiliary input to the computation. For technical reasons 
we use a variant of complexity, so-called prefix complexity, 
which is associated with Turing machines for which the set 
of programs resulting in a halting computation is prefix free. 
We realize prefix complexity by considering a special type of 
Turing machine with a one-way input tape, a separate work 
tape, and a one-way output tape. Such Turing machines are 
called prefix Turing machines. If a machine T halts with output 
X after having scanned all of p on the input tape, but not 
further, then T{p) = x and we call p a program for T. It is 
easy to see that {p : T{p) — x,x ^ {0, 1}*} is a prefix code. 

Definition 2.1: A function / from the natural numbers to 
the natural numbers is partial recursive, or computable, if there 
is a Turing machine T that computes it: /(x) = T{x) for 
all X for which either f or: T (and hence both) are defined. 
This definition can be extended to (multi-tuples of) rational 
arguments and values. 

Let Ti , T2 , . . . be a standard enumeration of all prefix Turing 
machines with a binary input tape, for example the lexi- 
cographical length-increasing ordered syntactic prefix Turing 
machine descriptions, [16], and let ipi,(f>2, . . . be the enumer- 
ation of corresponding functions that are computed by the 
respective Turing machines (Ti computes (pi). These functions 
are the partial recursive functions of effectively prefix-free 
encoded arguments. The Kolmogorov complexity of x is 
the length of the shortest binary program from which x is 
computed by such a function. 



Definition 2.2: The prefix Kolmogorov complexity of x is 
K{x) = min{?(i) + l{p) : Ti{p) = x}, (II.l) 

where the minimum is taken over p e {0, 1}* and i G 
{1,2,...}. For the development of the theory we actually 
require the Turing machines to use auxiliary (also called 
conditional) information, by equipping the machine with a 
special read-only auxiliary tape containing this information at 
the outset. Then, the conditional version K{x \ y) of the prefix 
Kolmogorov complexity of x given y (as auxiliary informa- 
tion) is is defined similarly as before, and the unconditional 
version is set to K{x) = K{x \ e). 

Notation 2.3: From now on, we will denote by < an in- 
equality to within an additive constant, and by = the situation 
when both < and > hold. 

B. Two-Part Codes 

Let Ti,T2, . ■ . be the standard enumeration of Turing ma- 
chines, and let J7 be a standard Universal Turing machine 
satisfying U{{i,p)) ~ Ti{p) for all indices i and programs p. 
We fix U once and for all and call it the reference universal 
prefix Turing machine. The shortest program to compute x 
by U is denoted as x* (if there is more than one of them, 
then X* is the first one in standard enumeration). It is a deep 
and useful fact that the shortest effective description of an 
object X can be expressed in terms of a two-part code: the 
first part describing an appropriate Turing machine and the 
second part describing the program that interpreted by the 
Turing machine reconstructs x. The essence of the theory is the 
Invariance Theorem, that can be informally stated as follows: 
For convenience, in the sequel we simplify notation and write 
U{x,y) for U{{x,y)). Rewrite 

K{x) mm{l(i) + l{p) : T,{p) = x} 

mm{2l{i) + l{p) + l: T,{p) = x} 

p.i 

< min{l{q) : U{e, q) = x} + 2l{u) + 1 

9 

< min{/(j*) + l{r) : U{e,j*ar) = Tj(r) — x} 

+ 2/(u) + l 

< K{x). 

Here the minima are taken over p,q,r e {0, 1}* and i,j € 
{1,2,...}. The last equalities are obtained by using the 
universality of U — Tu with l{u) = 0. As consequence, 

K{x) = min{Z(j*) + Z(r) : U{e,far) = Tj{r) = x} 

K{x) = Ku{x) = min{;(q) : [/(e, g) = x}. 

Thus, K{x) and Kij{x) differ by at most an additive constant 
depending on the choice of U . It is standard to use 

K{x)=Ku{x) (11.2) 

instead of (III. Il l as the definition of prefix Kolmogorov com- 
plexity, [16]. However, we highlighted definition jlLlb to bring 
out the two-part code nature. By universal logical principles. 
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the resulting theory is recursively invariant under adopting 
either definition jlLll i or definition jll.H . as long as we stick 
to one choice. If T stands for a literal description of the prefix 
Turing machine T in standard format, for example the index 
j when T = T,, then we can write K{T) = K{j). The 
string j* is a shortest self-delimiting program of K{j) bits 
from which U can compute j, and subsequent execution of 
the next self-delimiting fixed program a will compute j from 
j. Altogether, this has the effect that U(e,j*ar) ~ Tj{r). If 
{jo,ro) minimizes the expression above, then Tjg{rQ) — x, 

and hence K{x\jo) < l{ro), and K{jo) + Z(ro) K{x). 
It is straightforward that i^(jo) + K{x\jo) > K{x,jo) > 
K{x), and therefore we have /(rp) < K{x\jQ). Altogether, 
U^'o) — K{x\jQ). Replacing the minimizing j = jo by the 
minimizing T = Tj„ and /(ro) by K{x\T), we can rewrite the 
last displayed equation as 

K{x) ^ imn{K{T) + K{x\T):T^ {Tq.Tu . . .}}. (II.3) 

C. Meaningful Information 

Expression (|II.3l l emphasizes the two-part code nature of 
Kolmogorov complexity; using the regular aspects of x to 
maximally compress. Suppose we consider an ongoing time- 
series 0101 . . . and we randomly stop gathering data after 
having obtained the initial segment 

X = 10101010101010101010101010. 

We can encode this a; by a small Turing machine representing 
"the repeating pattern is 01," and which computes x, for ex- 
ample, from the program "13." Intuitively, the Turing machine 
part of the code squeezes out the regularities in x. What is 
left are irregularities, or random aspects of x relative to that 
Turing machine. The minimal-length two-part code squeezes 
out regularity only insofar as the reduction in the length of the 
description of random aspects is greater than the increase in the 
regularity description. In this setup the number of repetitions 
of the significant pattern is viewed as the random part of the 
data. 

This interpretation of K{x) as the shortest length of a 
two-part code for x, one part describing a Turing machine, 
or model, for the regular aspects of x and the second part 
describing the irregular aspects of x in the form of a program 
to be interpreted by T, has profound applications. 

The "right model" is a Turing machine T among the ones 
that halt for all inputs, a restriction that is justified later, 
and reach the minimum description length in (III. 3b . This T 
embodies the amount of useful information contained in x. It 
remains to decide which such T to select among the ones that 
satisfy the requirement. Following Occam's Razor we opt here 
for the shortest one — a formal justification for this choice is 
given in [23]. The main problem with our approach is how to 
properly define a shortest program x* for x that divides into 
parts X* = pq such that p represents an appropriate T. 

D. Symmetry of Information 

The following central notions are used in this paper The 
information in x about y is I{x : y) = K{y) — K{y \ x*). By 



the symmetry of information, a deep result of [5], 

K{x, y) = K{x) + K{y | x*) = K{y) + K{x \ y*). (II.4) 

Rewriting according to symmetry of information we see that 
I{x : y) = I{y : x) and therefore we call the quantity I{x : y) 
the mutual information between x and y. 

III. Model Classes 

Instead of the model class of finite sets, or computable 
probability density functions, as in [6], in this work we focus 
on the most general form of algorithmic model class: total 
recursive functions. We define the different model classes and 
summarize the central notions of "randomness deficiency" and 
"typicality" for the canonical finite set models to obtain points 
of reference for the related notions in the more general model 
classes. 

A. Set Models 

The model class of finite sets consists of the set of finite sub- 
sets S C {0, 1}*. The complexity of the finite set S is K{S) — 
the length (number of bits) of the shortest binary program p 
from which the reference universal prefix machine U computes 
a listing of the elements of S and then halts. That is, if 5* = 

{xi,...,Xn}, then U{p) = {XI,{X2, ■ ■ ■ ,{Xn-l,Xn) ■ ■ ■))■ 

The conditional complexity K(x \ S) of x given S, is the 
length (number of bits) in the shortest binary program p 
from which the reference universal prefix machine U, given S 
literally as auxiliary information, computes x. For every finite 
set S C {0, 1}* containing x we have 

I 5) < loglS-l. (III.l) 

Indeed, consider the selfdelimiting code of x consisting of its 
[log 15*1] bit long index of x in the lexicographical ordering of 
S. This code is called data-to-model code. Its length quantifies 
the maximal "typicality," or "randomness," data (possibly 
different from x) can have with respect to this model. The 
lack of typicality of x with respect to S is measured by the 
amount by which K{x \ S) falls short of the length of the data- 
to-model code, the randomness deficiency of x in S, defined 
by 

5{x I S) = log IS*! - K{x I S*), (III.2) 

for X G S, and oo otherwise. Data x is typical with respect 
to a finite set S, if the randomness deficiency is small. If 
the randomness deficiency is close to 0, then there are no 
simple special properties that single it out from the majority of 
elements in S. This is not just terminology. Let S C {0, 1}". 
According to common viewpoints in probability theory, each 
property represented by S defines a large subset of S consist- 
ing of elements having that property, and, conversely, each 
large subset of S represents a property. For probabilistic 
ensembles we take high probability subsets as properties; the 
present case is uniform probability with finite support. For 
some appropriate fixed constant c, let us identify a property 
represented by S with a subset 5" of S of cardinality |S"| > 
(1 — l/c)|5|. If 5{x I S) is close to 0, then x satisfies (that 
is, is an element of) all properties (that is, sets) S" C of 
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low Kolmogorov complexity K{S') — O(logn). The precise 
statements and quantifications are given in [16], [21], and we 
do not repeat them here. 

B. Probability Models 

The model class of computable probability density functions 
consists of the set of functions P : {0,1}* [0,1] with 
^P(x) = 1. "Computable" means here that there is a 
Turing machine Tp that, given x and a positive rational e, 
computes P{x) with precision e. The (prefix-) complexity 
K{P) of a computable (possibly partial) function P is defined 
by K{P) = min,;{A'(i) : Turing machine Ti computes P}. 

C. Function Models 

The model class of total recursive functions consists of the 
set of functions / : {0,1}* {0,1}* such that there is a 
Turing machine T such that T{i) < oo and f{i) = T{i), 
for every i G {0, 1}*. The (prefix-) complexity K{f) of a 
total recursive function / is defined by K{f) = mmi{K{i) : 
Turing machine Ti computes /}. If /* is a shortest program 
for computing the function / (if there is more than one of them 
then /* is the first one in enumeration order), then K{f) = 

Kf*)- 

Remark 3.1: In the definitions of K{P) and K{f), the 
objects being described are functions rather than finite binary 
strings. To unify the approaches, we can consider a finite 
binary string x as corresponding to a function having value 
X for argument 0. Note that we can upper semi-compute x* 
given X, but we cannot upper semi-compute P* given P (as 
an oracle), or /* given / (again given as an oracle), since 
we should be able to verify agreement of a program for a 
function and an oracle for the target function, on all infinitely 
many arguments. 

IV. Typicality 

To explain typicality for general model classes, it is con- 
venient to use the distortion-rate [17], [18] approach for 
individual data recently introduced in [9], [22]. Modeling the 
data can be viewed as encoding the data by a model: the data 
are source words to be coded, and models are code words for 
the data. As before, the set of possible data is 2? = {0, 1}*. 
Let 7?.+ denote the set of non-negative real numbers. For every 
model class M. (particular set of code words) we choose an 
appropriate recursive function d : V x Ai ^ TZ'^ defining the 
distortion d{x, M) between data x and model M. 

Remark 4.1: The choice of distortion function is a selection 
of which aspects of the data are relevant, or meaningful, and 
which aspects are irrelevant (noise). We can think of the distor- 
tion as measuring how far the model falls short in representing 
the data. Distortion-rate theory underpins the practice of lossy 
compression. For example, lossy compression of a sound file 
gives as "model" the compressed file where, among others, 
the very high and very low inaudible frequencies have been 
suppressed. Thus, the distortion function will penalize the 
deletion of the inaudible frequencies but lightly because they 
are not relevant for the auditory experience. 



Example 4.2: Let us look at various model classes and 
distortion measures: 

(i) The set of models are the finite sets of finite binary 
strings. Let S C {0,1}* and \S\ < oo. We define d{x,S) = 
log \ S\ if X ^ S, and oo otherwise. 

(ii) The set of models are the computable probability density 
functions P mapping {0,1}* to [0,1]. We define d{x,S) = 
logl/P(x) if P{x) > 0, and oo otherwise. 

(iii) The set of models are the total recursive functions 
/ mapping {0,1}* to AA. We define d{x,f) ~ m\n{l(d) : 
f{d) — x}, and oo if no such d exists. () 
If is a model class, then we consider distortion balls of 
given radius r centered on M G A4: 

BM(r)^{v:d{v,M)<r}. 

This way, every model class and distortion measure can be 
treated similarly to the canonical finite set case, which, how- 
ever is especially simple in that the radius not variable. That 
is, there is only one distortion ball centered on a given finite 
set, namely the one with radius equal to the log-cardinality of 
that finite set. In fact, that distortion ball equals the finite set 
on which it is centered. 

Let Al be a model class and d a distortion measure. Since 
in our definition the distortion is recursive, given a model M G 
Ai and diameter r, the elements in the distortion ball of diam- 
eter r can be recursively enumerated from the distortion func- 
tion. Giving the index of any element x in that enumeration 
we can find the element. Hence, K{x\M,r) < log [i^Af (r)|. 
On the other hand, the vast majority of elements y in the 
distortion ball have complexity K{y\M,r) > log\BM{r)\ 
since, for every constant c, there are only 2^°^ l^'^^''^'^'^ — 1 
binary programs of length < log \BM{r)\ — c available, and 
there are \BM{r)\ elements to be described. We can now 
reason as in the similar case of finite set models. With data 
cc and r = d{x,M), if K {x\M , d{x , M)) > \B m {d{x , M))\, 
then X belongs to every large majority of elements (has the 
property represented by that majority) of the distortion ball 
\BM{d{x,M))\, provided that property is simple in the sense 
of having a description of low Kolmogorov complexity. 

Definition 4.3: the randomness deficiency of x with respect 
to model M under distortion d is defined as 

5{x I M) ^\og\BM{d{x,M))\- K{x\M,d{x,M)). 

Data X is typical for model M G M. (and that model "typical" 
or "best fitting" for x) if 

5{x\M)^Q. (IV. 1) 

If x is typical for a model M, then the shortest way to 
effectively describe x, given M, takes about as many bits as 
the descriptions of the great majority of elements in a recursive 
enumeration of the distortion ball. So there are no special 
simple properties that distinguish x from the great majority of 
elements in the distortion ball: they are all typical or random 
elements in the distortion ball (that is, with respect to the 
contemplated model). 

Example 4.4: Continuing Example 14.21 by applying jl V. 1 1) 
to different model classes: 
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(i) Finite sets: For finite set models S, clearly K{x\S) < 
log IS"!- Together with dlV.ll i we have that x is typical for S, 
and S best fits x, if the randomness deficiency according to 
jlini l satisfies S{x\S) = 0. 

(ii) Computable probability density functions: Instead of 
the data-to-model code length loglS"! for finite set models, 
we consider the data-to-model code length logl/P(a;) (the 
Shannon-Fano code). The value logl/P(a;) measures how 
Ukely X is under the hypothesis P. For probability models P, 
define the conditional complexity K{x \ P, [logl/P(a;)]) as 
follows. Say that a function A approximates P if \A{y, e) — 
P{y)\ < e for every y and every positive rational e. Then 
K{x I P, [log 1/P(a;)]) is defined as the minimum length 
of a program that, given [logl/P(a;)] and any function A 
approximating P as an oracle, prints x. 

Clearly i^(x|P, [log 1/P(.t)]) < logl/P(a;). Together 
with (IIV. lb . we have that x is typical for P, and P best 
fits X, if i^(x|P, [logl/P(x)l) > \og\{y : logl/P{y) < 
logl/P(x)}[ The right-hand side set condition is the same 
as P(y) > P{x), and there can be only < 1/P{x) such 
y, since otherwise the total probability exceeds 1. There- 
fore, the requirement, and hence typicality, is implied by 
i4:(x|P, [logl/P(x)]) > logl/P(a;). Define the random- 
ness deficiency by S{x \ P) = logl/P(a;) — K{x \ 
P, [logl/P(a;)]). Altogether, a string x is typical for a distri- 
bution P, or P is the best fitting model for x, if S{x \ P) = 0. 

(iii) Total Recursive Functions: In place of log 1 5*1 for 
finite set models we consider the data-to-model code length 
(actually, the distortion d{x, /) above) 

Uf) = mm{l{d) : f{d) = x}. 

Define the conditional complexity K{x \ f,lx{f)) as the 
minimum length of a program that, given lx{f) and an oracle 
for /, prints x. 

Clearly, K{x\fMf)) < W)- Together with (|IVB, 
we have that x is typical for /, and / best fits x, if 

K{x\fMf)) > log{y : lyif) < W)}. There are at 
most (2'="(-')+^ — 1)- many y satisfying the set condition 
since ly{f) £ {0, 1}*. Therefore, the requirement, and hence 

typicality, is implied by K{x\f ^Ixif)) > lx{f)- Define the 
randomness deficiency by d{x | /) = lx{f) — K{x \ /, lx{f ))- 
Altogether, a string x is typical for a total recursive function 
/, and / is the best fitting recursive function model for x if 
5{x I /) =0, or written differently, 

K{x\fMf))^Uf)- (IV.2) 

Note that since lx{f) is given as conditional information, 
with lx{f) l{d) and /(d) = x, the quantity K{x\f,lx{f)) 
represents the number of bits in a shortest self-delimiting 
description of d. 

Remark 4.5: We required lx{f) in the conditional in ilW.ll . 
This is the information about the radius of the distortion ball 
centered on the model concerned. Note that in the canonical 
finite set model case, as treated in [11], [6], [21], every model 
has a fixed radius which is explicitly provided by the model 
itself. But in the more general model classes of computable 
probability density functions, or total recursive functions. 



models can have a variable radius. There are subclasses of the 
more general models that have fixed radiuses (like the finite 
set models). 

(i) In the computable probability density functions one can 
think of the probabilities with a finite support, for example 
P„(x) = 1/2" for l{x) = n, and P{x) = otherwise. 

(ii) In the total recursive function case one can similarly 
think of functions with finite support, for example /„(a;) = 

Xi for X — xi . . . Xn, and fn{x) = for l{x) ^ n. 
The incorporation of te radius in the model will increase the 
complexity of the model, and hence of the minimal sufficient 
statistic below. 

V. SuFHCiENT Statistic 

A statistic is a function mapping the data to an ele- 
ment (model) in the contemplated model class. With some 
sloppiness of terminology we often call the function value 
(the model) also a statistic of the data. The most important 
concept in this paper is the sufficient statistic. For an extensive 
discussion of this notion for specific model classes see [6], 
[21]. A statistic is called sufficient if the two-part description 
of the data by way of the model and the data-to-model code is 
as concise as the shortest one-part description of x. Consider 
a model class A4. 

Definition 5.1: A model M G is a sufficient statistic for 
X if 

K{M,d{x,M))+\og\BM{d{x,M))\ ^ K{x). (V.l) 
Lemma 5.2: If M is a sufficient statistic for x, then K{x \ 
M,d{x,M) = \og\BM{d{x,M))\, that is, X is typical for M. 

Proof: We can rewrite K{x) < K{x, M ,d{x, M)) < 
K{M,d{x,M)) + K{x\M,d{x,M)) < K{M,d{x,M)) + 
\og\BM{d{x, M))\ = K(x). The first three inequalities are 
straightforward and the last equality is by the assumption of 
sufficiency. Altogether, the first sum equals the second sum, 
which implies the lemma. ■ 
Thus, if AI is a sufficient statistic for x, then a; is a typical 
element for M, and M is the best fitting model for x. Note that 
the converse implication, "typicality" implies "sufficiency," is 
not valid. Sufficiency is a special type of typicality, where the 
model does not add significant information to the data, since 
the preceding proof shows K{x) = K{x, M, d{x, M)). Using 
the symmetry of information ill.Ai this shows that 

K{M, d{x, M) I x) = K{M \x)=0. (V.2) 

This means that: 

(i) A sufficient statistic M is determined by the data in 
the sense that we need only an 0(l)-bit program, possibly 
depending on the data itself, to compute the model from the 
data. 

(ii) For each model class and distortion there is a universal 
constant c such that for every data item x there are at most c 
sufficient statistics. 

Example 5.3: Finite sets: For the model class of finite sets, 
a set S* is a sufficient statistic for data x if 

K{S)+log\S\^ Kix). 
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Computable probability density functions: For the model 
class of computable probability density functions, a function 
P is a sufficient statistic for data x if 

K{P) +logl/P{x) = 0. 



Definition 5.4: For the model class of total recursive func- 
tions, a function / is a sufficient statistic for data x if 

K{x)^KU) + W)- (V.3) 
Following the above discussion, the meaningful information 
in X is represented by / (the model) in K{f) bits, and the 
meaningless information in x is represented by d (the noise 
in the data) with f{d) = x in l{d) = lx{J) bits. Note that 
l{d) = K{d) = K{d\f*), since the two-part code {f*,d) 
for X cannot be shorter than the shortest one-part code of 
K{x) bits, and therefore the d-part must akeady be maximally 
compressed. By Lemma \52\ lx{f) == K{x \ f*,lxif)), x is 
typical for /, and hence K{x) = K{f) + K{x \ f*,lx{f))- 

VI. Minimal Sufficient Statistic 

Definition 6.1: Consider the model class of total recursive 
functions. A minimal sufficient statistic for data a: is a sufficient 
statistic ( IV.3I ) for x of minimal prefix complexity. Its length is 
known as the sophistication of x, and is defined by soph(a;) = 
mm{K{f):K{f) + l4f)^K{x)}. 

Recall that the reference universal prefix Turing machine U 
was chosen such that U{T, d) = T{d) for all T and d. Looking 
at it slightly more from a programming point of view, we can 
define a pair (T, d) to be a description of a finite string x, 
if U{T, d) prints x and T is a Turing machine computing 
a function / so that f{d) = x. For the notion of minimal 
sufficient statistic to be nontrivial, it should be impossible to 
always shift, if f{d) = x and K{f) + lx{f) = K{x) with 
K{f) ^ 0, always information information from f to d and 
write, for example, f'{d') = x with K{f') + l^if) = K{x) 
with K{f') ^ 0. If the model class contains a fixed universal 
model that can mimic all other models, then we can always 
shift all model information to the data-to-(universal) model 
code. Note that this problem doesn't arise in common statis- 
tical model classes: these do not contain universal models in 
the algorithmic sense. First we show that the partial recursive 
recursive function model class, because it contains a universal 
element, does not allow a straightforward nontrivial division 
into meaningful and meaningless information. 

Lemma 6.2: Assume for the moment that we allow all 
partial recursive programs as statistic. Then, the sophistication 
of all data a; is =0. 

Proof: Let the index of U (the reference universal 
prefix Turing machine) in the standard enumeration Ti,T2, . . . 
of prefix Turing machines be u. Let Tf be a Turing ma- 
chine computing /. Suppose that U{Tf,d) — x. Then, also 
U{u,{Tf,d)) = U{Tf,d) ^ X. m 

Remark 6.3: This shows that unrestricted partial recursive 
statistics are uninteresting. Naively, this could leave the im- 
pression that the separation of the regular and the random 
part of the data is not as objective as the whole approach 
lets us hope for If we consider complexities of the minimal 



sufficient statistics in model classes of increasing power: 
finite sets, computable probability distributions, total recursive 
functions, partial recursive functions, then the complexities 
appear to become smaller all the time eventually reaching 
zero. It would seem that the universality of Kolmogorov 
complexity, based on the notion of partial recursive functions, 
would suggest a similar universal notion of sufficient statistic 
based on partial recursive functions. But in this case the very 
universality trivializes the resulting definition: because partial 
recursive functions contain a particular universal element that 
can simulate all the others, this implies that the universal 
partial recursive function is a universal model for all data, 
and the data-to-model code incorporates all information in 
the data. Thus, if a model class contains a universal model 
that can simulate all other models, then this model class is 
not suitable for defining two-part codes consisting of mean- 
ingful information and accidental information. It turns out 
that the key to nontrivial separation is the requirement that 
the program witnessing the sophistication be total. That the 
resulting separation is non-trivial is evidenced by the fact, 
shown below, that the amount of meaningful information in 
the data does not change by more than a logarithmic additive 
term under change of model classes among finite set models, 
computable probability models, and total recursive function 
models. That is, very different model classes all result in the 
same amount of meaningful information in the data, up to 
negligible differences. So if deterioration occurs in widening 
model classes it occurs all at once by having a universal 
element in the model class. (} 

Apart from triviality, a class of statistics can also possibly be 
vacuous by having the length of the minimal sufficient statistic 
exceed K{x). Our first task is to determine whether the 
definition is non-vacuous. We will distinguish sophistication 
in different description modes: 

Lemma 6.4 (Existence): For every finite binary string x, the 
sophistication satisfies soph(a;) < K{x). 

Proof: By definition of the prefix complexity there is a 
program x* of length l{x*) = K{x) such that U{x*,e) — x. 
This program x* can be partial. But we can define another 
program = sx* where s is a program of a constant number 
of bits that tells the following program to ignore its actual input 
and compute as if its input were e. Clearly, x* is total and is 
a sufficient statistic of the total recursive function type, that 

is, soph(x) < l{x*) < l{x*) = K(x). ■ 
The previous lemma gives an upper bound on the sophistica- 
tion. This still leaves the possibility that the sophistication is 
always = 0, for example in the most liberal case of unrestricted 
totality. But this turns out to be impossible. 

Theorem 6.5: (i) For every x, if a sufficient statistic / 
satisfies K{l4f)\f*) ^ 0, then K{f) > K{K{x)) and 
W)<K{x)-K{K{x)). 

(ii) For a: as a variable running through a sequence of finite 
binary strings of increasing length, we have 

liminf soph(x) = 0. (VI. 1) 

l[x) — *oo 

(iii) For every n, there exists an x of length n, such that 
every sufficient statistic / for x that satisfies K{lx{f)\f*) = 
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has K{f) > n. 

(iv) For every n there exists an x of length n such that 
soph(a;) > n ~ log rt — 2 log log n. 

Proof: (i) If / is a sufficient statistic for x, then 

K{x) ^ K{f) + K{d I /*) ^ K{J) + I4f). (VI.2) 

Since K{lx{f)\f*) = 0, given an 0(1) bit program q we 
can retrieve both lx{f) and and also K{f) ~ l{f*) from /*. 
Therefore, we can retrieve K{x) = K{f) + lx{f) from qf*. 
That shows that K{K(x)) < K{f). This proves both the first 
statement, and the second statement follows by (IVI.2l i. 

(ii) An example of very unsophisticated strings are the 
individually random strings with high complexity: x of length 
l{x) — n with complexity K{x) — n + K{n). Then, the 
identity program l with L{d) = d for all d is total, has 
complexity K{l) = 0, and satisfies K(x) = K{l) + l{x*). 
Hence, t witnesses that soph(a;) = 0. This shows jVI.lb . 

(iii) Consider the set S'™ = {y : K{y) < m]. By [6] we 
have log|5'"| = m — K(m). Let m < n. Since there are 
2" strings of length n, there are strings of length n not in 
S"\ Let X be any such string, and denote k = K{x). Then, 
by construction k > m and by definition k < n + K{n). 
Let / be a sufficient statistic for x. Then, K{f) + lx{f) — 
k. By assumption, there is an 0(l)-bit program q such that 
U{qf*) = Ix'if)- Let d witness lx{f) by f{d) = x with l{d) = 
Ixlf). Define the set D = {0, 1}'-^). Clearly, d<^D. Since x 
can be retrieved from / and the lexicographical index of d in 
D, and log \D\ = lx{f), we have K{f) + log \D\ = k. Since 
we can obtain D from qf* we have K{D) < K{f). On the 
other hand, since we can retrieve x from D and the index of 
d in D, we must have K{D) + log|D| > fc, which implies 
K{D) > K{f). Altogether, therefore, K{D) ^ K{f). 

We now show that we can choose x so that K{D) > n, and 
therefore K{f) > n. For every length n, there exists a z of 
complexity K{z \ n) < n such that a minimal sufficient finite 
set statistic S for z has complexity at least K{S \ n) > n, by 
Theorem IV.2 of [6]. Since {z} is trivially a sufficient statistic 
for z, it follows K{z \ n) — K{S \ n) — n. This implies 

K{z),K{S) > n. Therefore, we can choose m — n — C2 
for a large enough constant C2 so as to ensure that z ^ S*™. 
Consequently, we can choose x above as such a z. Since every 
finite set sufficient statistic for x has complexity at least that 
of an finite set minimal sufficient statistic for x, it follows that 
K{D) > n. Therefore, K{f) > n, which was what we had to 
prove. 

(iv) In the proof of (i) we used K{lx[f)\f*) = 0. Without 
using this assumption, the corresponding argument yields k < 
K{f) + K{lx{f))) + log \D\. We also have K{f) + /,(/) < 
k and l{d) = \og\D\. Since we can retrieve x from D and 
its index in D, the same argument as above shows \K{f) — 
K{D)\ < K{lx{f)), and still following the argument above, 

K{f) > n-K{lx{f)). Since lx{f) < n we have K{lx{f)) < 
log n + 2 log log n. This proves the statement. ■ 
The useful jV.H states that there is a constant, such that 
for every x there are at most that constant many sufficient 



statistics for x, and there is a constant length program (possibly 
depending on x), that generates all of them from x*. In fact, 
there is a slightly stronger statement from which this follows: 
Lemma 6.6: There is a universal constant c, such that for 
every x, the number of f*d such that f{d) = x and K{f) + 
l{d) = K{x), is bounded above by c. 

Proof: Let the prefix Turing machine T/ compute /. 
Since U{Tf,d) = x and K{Tf) + l{d) = K{x), the combina- 
tion f *d (with self-delimiting /*) is a shortest prefix program 
for X. From [16], Exercise 3.3.7 item (b) on p. 205, it follows 
that the number of shortest prefix programs is upper bounded 
by a universal constant. ■ 

VII. Relation Between Sufficient Statistic for 
Different Model Classes 

Previous work studied sufficiency for finite set models, and 
computable probability mass functions models, [6]. The most 
general models that are still meaningful are total recursive 
functions as studied here. We show that there are correspond- 
ing, almost equivalent, sufficient statistics in all model classes. 

Lemma 7.1: (i) If S" is a sufficient statistic of x (finite set 
type), then there is a corresponding sufficient statistic P of 
X (probability mass function type) such that K{P) = K{S), 
logl/F(x) = logl^l, and K{P \ x*) = 0. 

(ii) If P is a sufficient statistic of x of the computable total 
probability density function type, then there is a corresponding 
sufficient statistic / of x of the total recursive function type 
such that K{f) ^ K{P), Ixif) = logl/P(x), and K{f \ 
X*) i 0. 

Proof: (i) By assumption, 5 is a finite set such that 
X e S and K{x) = K{S) + log|5|. Define the probability 
distribution P{y) — l/\S\ for y G 5 and P{y) — otherwise. 
Since S is finite, P is computable. Since K{S) = K{P), and 
log |5| = [log 1/P{x)~\, we have K{x) = K\p)+\og 1/P{x). 
Since P is a computable probability mass function we have 
K{x I P*) < log 1/P(a;), by the standard Shannon-Fano code 
construction [3] that assigns a code word of length log 1/P(a;) 
to X. Since by (IIL4] | we have K{x) < K{x, P) = K{P) + 
K{x I P*) it follows that logl/P(x) < K{x \ P*). Hence, 
K{x I P*) = logl/P(a;). Therefore, by dHUl, K{x,P) ^ 
K{x) and, by rewriting K{x, P) in the other way according 
to (HIlli, K{P I X*) = 0. 

(ii) By assumption, P is a computable probability density 
function with P{x) > and K{x) ^ K{P) + logl/P{x). 
The witness of this equality is a shortest program P* for P 
and a code word Sx for x according to the standard Shannon- 
Fano code, [3], with l{sx) = logl/P(a;). Given P, we can 
reconstruct x from Sx by a fixed standard algorithm. Define 
the recursive function / from P such that f{sx) = x. In fact, 
from P* this only requires a constant length program q, so 
that Tf = qP* is a program that computes / in the sense that 
U{Tf, d) — f{d) for all d. Similarly, P can be retrieved from 
/. Hence, K{f) = K{P) and K{x) = K{f) + l{sx). That 
is, / is a sufficient statistic for x. Also, / is a total recursive 
function. Since f{sx) = a; we have K{x \ /*) < l{sx), and 
K{x I /*) < l{sx). This shows that K{x) > K{f) + K{x \ 
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/*), and since x can by definition be reconstructed from /* 
and a program of length K{x \ /*), it follows that equality 
must hold. Consequently, l{sx) = K{x \ /*), and hence, by 
K{x,f) ^ K{x) snA K{f \x*) = 0. ■ 

We have now shown that a sufficient statistic in a less 
general model class corresponds directly to a sufficient statistic 
in the next more general model class. We now show that, 
with a negligible error term, a sufficient statistic in the most 
general model class of total recursive functions has a directly 
corresponding sufficient statistic in the least general finite set 
model class. That is, up to negligible error terms, a sufficient 
statistic in any of the model classes has a direct representative 
in any of the other model classes. 

Lemma 7.2: Let a; be a string of length n, and / be a total 
recursive function sufficient statistic for x. Then, there is a 
finite set S" 3 a: such that K{S) + \og\S\ = X (x) + O (log n) . 

Proof: By assumption there is an 0(l)-bit program q 
such that U{qf*) = Ixif)- For each y e {0,1}*, let iy = 
min{i : f{i) = y}. Define S = {y : f{iy) = y,l{iy) < 
lx{f)}- We can compute S by computation of /(i), on all 
arguments i of at most < lx{f) bits, since by assumption / 
is total. This shows KiS) < K{f, l^if)) < K{f) + K{lx{f )). 
Since lx{f) < K{x), we have lx{f) < l{x) = n. Moreover, 
logl^l = Ixif). Since x e S, K{x) < K{S) +log|5| < 
K{x) + 0(logn), where we use the sufficiency of / to obtain 
the last inequality. ■ 

VIII. Algorithmic Properties 

We investigate the recursion properties of the sophistication 
function. In [5], Gacs gave an important and deep result 
d VIII. Il l below, that quantifies the uncomputability of K{x) 
(the bare uncomputability can be established in a much simpler 
fashion). For every length n there is an x of length n such 
that: 

logn - log log 71 < K{K{x) I x) < logn. (VIII. 1) 

Note that the right-hand side holds for every x by the simple 
argument that K{x) < n + 2 logn and hence K{K{x)) < 
logn. But there are a;'s such that the length of the shortest 
program to compute K{x) almost reaches this upper bound, 
even if the full information about x is provided. It is natural to 
suppose that the sophistication function is not recursive either. 
The following lemma's suggest that the complexity function 
is more uncomputable than the sophistication. 

Theorem 8.1: The function soph is not recursive. 

Proof: Given n, let xq be the least x such that soph(a::) > 
n — 2 logn. By Theorem 16.51 we know that there exist x such 
that soph(a;) oo for a; oo, hence xq exists. Assume by 
way of contradiction that the sophistication function is com- 
putable. Then, we can find xo, given n, by simply computing 
the successive values of the function. But then K{xq) < K{n), 
while by Lemma |64l K(xn) > soph(a;o) and by assumption 
soph(a;o) > n — 2 logn, which is impossible. ■ 

The halting sequence x = X1X2 • • ■ is the infinite binary 
characteristic sequence of the halting problem, defined by 



Xi = 1 if the reference universal prefix Turing machine U 
halts on the i\h input: U (i) < 00, and otherwise. 

Lemma 8.2: Let /* be a total recursive function sufficient 
statistic of x. 

(i) We can compute K{x) from /* and x, up to fixed 
constant precision, which implies that K{K{x) \ f*,x) = 0. 

(ii) If also K{lx{f)\f*) = 0, then we can compute K{x) 
from /*, up to fixed constant precision, which implies that 
KiK{x) I /*) ^ 0. 

Proof: (i) Since / is total, we can run /(e) on all strings 
e in lexicographical length-increasing order Since / is total 
we will find a shortest string cq such that /(eo) = x. Set 
Ixif) — K^o)- Since l{f*) = K{f), and by assumption, 
K{f) + Ixif) — Kix), we now can compute = Kix). 

(ii) follows from item (i). ■ 

Theorem 8.3: Given an oracle that on query x answers with 
a sufficient statistic f* of x and a c^; = as required below. 
Then, we can compute the Kolmogorov complexity function 
K and the halting sequence x- 

Proof: By Lemma 18.21 we can compute the function 
K{x), up to fixed constant precision, given the oracle (without 
the value Cx) in the statement of the theorem. Let Cx in 
the statement of the theorem be the difference between the 
computed value and the actual value of K{x). In [16], Exercise 
2.2.7 on p. 175, it is shown that if we can solve the halting 
problem for plain Turing machines, then we can compute the 
(plain) Kolmogorov complexity, and vice versa. The same 
holds for the halting problem for prefix Turing machines and 
the prefix Turing complexity. This proves the theorem. ■ 

Lemma 8.4: There is a constant c, such that for every x 
there is a program (possibly depending on x) of at most c 
bits that computes soph(a;) and the witness program / from 
x,K{x). That is, Kif \ x,Kix)) = 0. With some abuse of 
notation we can express this as i4r(soph | K) = 0. 

Proof: By definition of sufficient statistic /*, we have 
K{f) + Ixif) = Kix). By (|V2li the number of sufficient 
statistics for x is bounded by an independent constant, and 
we can generate all of them from a; by a = length program 
(possibly depending on x). Then, we can simply determine the 
least length of a sufficient statistic, which is soph(a;). ■ 

There is a subtlety here: Lemma WA\ is nonuniform. While 
for every x we only require a fixed number of bits to compute 
the sophistication from x, Kix), the result is nonuniform in 
the sense that these bits may depend on x. Given a program, 
how do we verify if it is the correct one? Trying all programs 
of length up to a known upper bound, we don't know if 
they halt or if they halt they halt with the correct answer 
The question arising is if there is a single program that 
computes the sopistication and its witness program for all x. 
In [21] this much more difficult question is answered in a 
strong negative sense: there is no algorithm that for every x, 
given X, Kix), approximates the sophistication of x to within 
precision /(x)/(10 log/(a;)). 

Theorem 8.5: For every x of length n, and /* the program 
that witnesses the sophistication of x, we have Kif* \ x) < 
log n. For every length n, there are strings x of length n, such 
that Kif* I a;) > logn - log logn. 
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Proof: Let /* witness the soph(x): That is, K{f) + 
^x{f) == K{x), and /(/*) = soph(a;). Using the conditional 
version of ( III.4I ). see [6], we find that K{K{x),f* \ x) 

= K{K{x) I x) + K{f* I K{x), K{K{x) \x),x) 

^ K{r \ x) + K{K{x) \ r ,K{r \ x),x). 

In Lemma ISTSl item (i), we show K{K(x) \ x, /*) = 0, hence 
also K{K{x) I f*,K{f* \ x), x) = 0. By LemmaESl K{f* \ 
K{x), x) = 0, hence also K{f* \ K(x),K{K{x) \ x), x) = 0. 
Substitution of the constant terms in the displayed equation 
shows 

K{K{x)J* I x) ^ K{f* I x) = K{K{x) \ x) = K{x* \ x). 

(VIII.2) 

This shows that the shortest program to retrieve /* from x 
is essentially the same program as to retrieve x* from x or 
K{x) from x. Using dVIII.lt . this shows that 

\ogl{x) > limsupi^(/* I x) > \ogl{x) — log log ^a;). 

— >oo 

Since /* is the witness program for /(/*) = soph(a;), we have 

Kn = K{n> K{f* \x). m 

Definition 8.6: A function / from the rational numbers 
to the real numbers is upper semicomputable if there is a 
recursive function H{x,t) such that H{x,t + 1) < H{x,t) 
and limj^oo ^^(a;, t) — f{x). Here we interprete the total 
recursive function H{{x, t)) — {p, q) as a function from pairs 
of natural numbers to the rationals: H{x,t) = p/q. If / is 
upper semicomputable, then —/is lower semicomputable. 
If / is both upper-a and lower semicomputable, then it is 
computable. 

Recursive functions are computable functions over the natural 
numbers. Since K{-) is upper semicomputable, [16], and from 
K{-) we can compute soph(a;), we have the following: 

Lemma 8.7: (i) The function soph(a;) is not computable to 
any significant precision. 

(ii) Given an initial segment of length 2^'^^' of the halting 
sequence x = X1X2 ■ • •, we can compute soph(x) from x. That 
is, K{soph{x) I X, xi . . . X2^'M ) = 0. 

Proof: (i) The fact that soph(a;) is not computable to any 
significant precision is shown in [21]. 

(ii) We can run U (p, d) for all (program, argument) pairs 
such that l{p) + l{d) < 2l{x). (Not l{x) since we are dealing 
with self-delimiting programs.) If we know the initial segment 
of X, as in the statement of the theorem, then we know which 
(program, argument) pairs halt, and we can simply compute 
the minimal value of l{p) + l{d) for these pairs. ■ 

IX. Discussion 

"Sophistication" is the algorithmic version of "minimal 
sufficient statistic" for data x in the model class of total recur- 
sive functions. However, the full stochastic properties of the 
data can only be understood by considering the Kolmogorov 
structure function Xx{o:) (mentioned earlier) that gives the 
length of the shortest two-part code of a; as a function of 
the maximal complexity a of the total function supplying the 
model part of the code. This function has value about l{x) 



for a close to 0, is nonincreasing, and drops to the line K{x) 
at complexity ao — soph(x), after which it remains constant, 
Xx{ct) = K{x) for a > ao, everything up to a logarithmic 
addive term. A comprehensive analysis, including many more 
algorithmic properties than are analyzed here, has been given 
in [21] for the model class of finite sets containing x, but it 
is shown there that all results extend to the model class of 
computable probability distributions and the model class of 
total recursive functions, up to an additive logarithmic term. 
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